Analyze the assortment of goods based on the transactions made by an online store of household goods.
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
import scipy.stats as stats
df = pd.read_csv('', sep=',')
df.head()
| date | customer_id | order_id | product | quantity | price | |
|---|---|---|---|---|---|---|
| 0 | 2018100100 | ee47d746-6d2f-4d3c-9622-c31412542920 | 68477 | Комнатное растение в горшке Алое Вера, d12, h30 | 1 | 142.0 |
| 1 | 2018100100 | ee47d746-6d2f-4d3c-9622-c31412542920 | 68477 | Комнатное растение в горшке Кофе Арабика, d12,... | 1 | 194.0 |
| 2 | 2018100100 | ee47d746-6d2f-4d3c-9622-c31412542920 | 68477 | Радермахера d-12 см h-20 см | 1 | 112.0 |
| 3 | 2018100100 | ee47d746-6d2f-4d3c-9622-c31412542920 | 68477 | Хризолидокарпус Лутесценс d-9 см | 1 | 179.0 |
| 4 | 2018100100 | ee47d746-6d2f-4d3c-9622-c31412542920 | 68477 | Циперус Зумула d-12 см h-25 см | 1 | 112.0 |
Conclusions:
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6737 entries, 0 to 6736 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 6737 non-null int64 1 customer_id 6737 non-null object 2 order_id 6737 non-null int64 3 product 6737 non-null object 4 quantity 6737 non-null int64 5 price 6737 non-null float64 dtypes: float64(1), int64(3), object(2) memory usage: 315.9+ KB
Conclusions:
# int into string
df['customer_id'] = df['customer_id'].astype('string')
df['product'] = df['product'].astype('string')
# int into datetime
df['date'] = df['date'].astype('string')
df['date'] = pd.to_datetime(df['date'], format='%Y%m%d%H')
# check
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 6737 entries, 0 to 6736 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 date 6737 non-null datetime64[ns] 1 customer_id 6737 non-null string 2 order_id 6737 non-null int64 3 product 6737 non-null string 4 quantity 6737 non-null int64 5 price 6737 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(2), string(2) memory usage: 315.9 KB
# building a histogrsam
df['date'].hist(bins=150);
# naming title and axes
plt.xlabel('time')
plt.ylabel('quantity')
plt.title('Events by date')
plt.show()
display(df['date'].min())
display(df['date'].max())
print("Records cover", df['date'].max() - df['date'].min())
Timestamp('2018-10-01 00:00:00')
Timestamp('2019-10-31 16:00:00')
Records cover 395 days 16:00:00
Conclusions:
df['quantity'].hist(bins=100);
plt.xlabel('quantity of goods')
plt.ylabel('records')
plt.title('Quantity of goods histogram')
plt.show()
# orders IDs histogram
display(df['order_id'].astype('int').hist())
<Axes: >
Conclusions:
Customers tend to buy 1-2 goods, i.e. this is a retail store, and wholesale purchases, if any, need to be excluded from further analysis.
Order numbers are sequential, so it is worth asking the store’s technical support specialists why the numbering is interrupted.
df['price'].hist(bins=100);
plt.xlabel('price')
plt.ylabel('records')
plt.title('Price of goods histogram')
plt.show()
Conclusions:
The price of the purchased goods generally does not exceed 1,000 RUB.
# Y+M+W
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['week'] = df['date'].dt.isocalendar().week
# WD+H
df['dow'] = df['date'].dt.dayofweek + 1
df['hour'] = df['date'].dt.hour
# Y+M
df['ymonth'] = df['year'].astype('string') + df['month'].astype('string')
df['ymonth'] = pd.to_datetime(df['ymonth'], format='%Y%m')
# revenue calculation
df['revenue'] = df['price'] * df['quantity']
# six-digits customer IDs
# unique customers
display(df['customer_id'].nunique())
2451
# column with six-digits customer IDs
df['cid'] = [x.strip()[-7:] for x in df['customer_id']]
# check
display(df['cid'].nunique())
2451
# short product names column
df['prod'] = [x.split()[0] + ' ' + x.split()[1] for x in df['product']]
df.isna().sum()
date 0 customer_id 0 order_id 0 product 0 quantity 0 price 0 year 0 month 0 week 0 dow 0 hour 0 ymonth 0 revenue 0 cid 0 prod 0 dtype: int64
df.duplicated().sum()
0
# let's see how many duplicates there are in our dataset
# the presence of such duplicates may indicate failures in the order system, or automatic orders
display(df.drop_duplicates(subset=['customer_id', 'order_id', 'product', 'quantity', 'price'])['date'].count() / df['date'].count())
0.7233189847112959
Such orders, which make up almost 28% of the dataset, should be excluded from the analysis
# deleting
df = df.drop_duplicates(subset=['customer_id', 'order_id', 'product', 'quantity', 'price'])
# next, we will check the adequacy of prices for goods worth more than 2000 RUB.
# were there any errors when entering these prices into system?
display(df.query('price >= 2000')
.pivot_table(index='prod', values='price', aggfunc='max'))
| price | |
|---|---|
| prod | |
| Tepмокружка AVEX | 2399.0 |
| Автоматическая щетка | 7229.0 |
| Афеляндра скуарроса | 3524.0 |
| Бак для | 3749.0 |
| Ведро для | 3749.0 |
| Весы напольные | 2849.0 |
| Гладильная доска | 7424.0 |
| Гладильная доска-стремянка | 2399.0 |
| Гортензия Микс | 3599.0 |
| Двуспальное постельное | 2024.0 |
| Доска гладильная | 3299.0 |
| Ерш для | 3524.0 |
| Карниз алюминиевый | 2099.0 |
| Коврик для | 5474.0 |
| Коврик придверный | 2009.0 |
| Комплект для | 5399.0 |
| Котел алюминиевый | 2924.0 |
| Мантоварка-пароварка WEBBER | 2219.0 |
| Мусорный контейнер | 5512.0 |
| Набор Vileda | 2924.0 |
| Набор инструментов | 5399.0 |
| Наматрасник Wellness | 3074.0 |
| Новогоднее дерево | 3524.0 |
| Одеяло Wellness | 4724.0 |
| Покрывало жаккард | 6134.0 |
| Полки QWERTY | 4312.0 |
| Пылесос DELTA | 2249.0 |
| Сиденье для | 6149.0 |
| Скатерть 350х150 | 2249.0 |
| Скатерть Арлет | 2174.0 |
| Скатерть Джулия | 2249.0 |
| Стремянка 5 | 3974.0 |
| Стремянка 7 | 7724.0 |
| Стремянка COLOMBO | 3449.0 |
| Стремянка Colombo | 2699.0 |
| Стремянка FRAMAR | 4499.0 |
| Стремянка Scab | 5549.0 |
| Стремянка алюминиевая | 4949.0 |
| Стремянка-табурет алюминиевая | 2699.0 |
| Стремянки Colombo | 3974.0 |
| Сумка-тележка 2-х | 2849.0 |
| Сумка-тележка 3-х | 2699.0 |
| Сумка-тележка TWIN | 2624.0 |
| Сумка-тележка хозяйственная | 8737.0 |
| Сушилка Meliconi | 5594.0 |
| Сушилка для | 7004.0 |
| Сушилка уличная | 14917.0 |
| Урна уличная | 7349.0 |
| Урна-пепельница из | 5287.0 |
| Фал капроновый | 2099.0 |
| Цитрофортунелла Кумкват | 3074.0 |
| Швабра для | 2624.0 |
| Швабра хозяйственная | 3224.0 |
| Штора для | 4424.0 |
We checked the prices of the above goods on the Internet and found that they do not differ significantly from those presented in the dataset
# let’s check if there are any orders attributed to several customers at once
bugged_orders = (
df.pivot_table(index='order_id', values='customer_id', aggfunc='nunique')
.reset_index()
.sort_values('customer_id', ascending=False)
.query('customer_id > 1')['order_id']
.to_list()
)
display(bugged_orders)
[72845, 71480, 69485, 69310, 69833, 72790, 72778, 14872, 71542, 71054, 71663, 70726, 69531, 70542, 70903, 69283, 71226, 71571, 69410, 69345, 70808, 70114, 70631, 71461, 72950, 71648, 70946, 68785, 72188]
# % of such orders
display(df.query('order_id in @bugged_orders')['order_id'].count() / df['order_id'].count())
0.013544018058690745
1,3%. To be deleted
# deleting
df = df.query('order_id not in @bugged_orders')
Conclusions:
We preprocessed the data, clearing the dataset from possible errors and duplicates We reduced the data to convenient types and added columns that will help us in further analysis
First, let’s take a look at the distribution of the quantity of purchased goods in order to exclude wholesale purchases from the analysis.
sns.boxplot(x=df['quantity'], orient='h')
plt.xlabel('quantity of goods, in pcs')
plt.title('Quantity of goods boxplot')
plt.show()
# detailed boxplot
sns.boxplot(x=df['quantity'], orient='h')
plt.axis([0, 200, None, None])
plt.xlabel('quantity of goods, in pcs')
plt.title('Detailed quantity of goods boxplot')
plt.show()
# wholesale goods
display(df.query('quantity > 50')
.pivot_table(index='prod', values='quantity', aggfunc='sum')
.reset_index()
.sort_values('quantity', ascending=False)
.head(15)
.plot(x='prod', y='quantity', kind='bar'));
plt.xlabel('Product')
plt.ylabel('quantity in pcs')
plt.title('Store sells the following wholesale goods')
plt.show()
<Axes: xlabel='prod'>
Conclusions:
There are many wholesale purchases in the dataset. Taking into account that the store’s assortment includes many small and/or paired items (forks, seeds, hooks etc.), we consider purchases of 50 pieces of the same product as wholesale
# % of wholesale
display(len(df.query('quantity > 50')) / len(df))
# wholesale revenue
display(df.query('quantity > 50')['revenue'].sum())
0.004784688995215311
880158.0
# checking that most revenue comes from the sale of single goods
# (there can be several records under one order ID)
revenue_top_quantity = (df.pivot_table(index='quantity', values='revenue', aggfunc='sum')
.reset_index()
.sort_values('revenue', ascending=False)
)
revenue_top_quantity.head()
| quantity | revenue | |
|---|---|---|
| 0 | 1 | 2258065.0 |
| 48 | 1000 | 675000.0 |
| 1 | 2 | 228022.0 |
| 9 | 10 | 107450.0 |
| 2 | 3 | 86046.0 |
# let's take a look at wholesale revenue trend before deleting the records
revenue_by_month = (df.query('quantity > 50')
.pivot_table(index='ymonth', values='revenue', aggfunc='sum')
.reset_index()
.sort_values('ymonth', ascending=True)
)
revenue_by_month.plot(x='ymonth', y='revenue', kind='bar')
plt.xlabel('month')
plt.ylabel('revenue in RUB')
plt.title('Wholesale revenue by month')
plt.show()
Therefore, wholesale purchases account for 0.5% of the data volume. From the table and monthly revenue chart we see that 700 kRUB of wholesale revenue comes from a single abnormal transaction recorded in June 2019.
Having excluded this sale, as well as transactions of more than 50 product pieces, we obtained adequate data on retail purchases
# exclusion of wholesale and abnormal records
df = df.query('quantity <= 50')
# prices boxplot
sns.boxplot(x=df['price'], orient='h')
plt.xlabel('price in RUB')
plt.title('Prices boxplot')
plt.show()
# detailed prices boxplot
sns.boxplot(x=df['price'], orient='h')
plt.axis([0, 2000, None, None])
plt.xlabel('Цена, у.е.')
plt.title('Detailed prices boxplot')
plt.show()
Conclusions:
Most goods are sold at prices ranging from 50 to 500 RUB a piece. Abnormal price starts at 1,000 RUB.However, we have previously analyzed prices above 2,000 RUB and concluded that there were no input errors or suspiciously high prices.
Therefore, we do not exclude additional data from the analysis.
# number of unique customers
display(df['customer_id'].nunique())
display(df['order_id'].nunique())
# average number of orders made by one unique customer
display(df['order_id'].nunique() / df['customer_id'].nunique())
2375
2734
1.1511578947368422
# total revenue
display(df['revenue'].sum())
# revenue by one unique customer
display(df['revenue'].sum() / df['customer_id'].nunique())
3214457.0
1353.4555789473684
# products quantity
display(df['product'].nunique())
# products in average order
display(df['product'].count() / df['order_id'].nunique())
2318
1.7498171177761521
Conclusions:
# let's add a file with the "product" - "category" pairs that we have entered as a separate dataset
df_cat = pd.read_csv(r'C:\Users\kvkoz\Downloads\cat1.csv', sep=',')
df_cat.head()
| product | cat | |
|---|---|---|
| 0 | Комнатное растение в горшке Алое Вера, d12, h30 | растения |
| 1 | Комнатное растение в горшке Кофе Арабика, d12,... | растения |
| 2 | Радермахера d-12 см h-20 см | растения |
| 3 | Хризолидокарпус Лутесценс d-9 см | растения |
| 4 | Циперус Зумула d-12 см h-25 см | растения |
# category column
df = pd.merge(df, df_cat, on='product')
Let us perform a detailed data analysis so as to define nuances in sales trends, and come up with recommendations for the store management
# revenue and selling volumes diagram
dbm = df.groupby(['ymonth'])[['revenue','quantity']].sum()
fig = plt.figure()
ax = fig.add_subplot(111)
ax2 = ax.twinx()
width = 0.3
dbm.revenue.plot(kind='bar', color='magenta', ax=ax, width=width, position=1, label='revenue')
dbm.quantity.plot(kind='bar', color='cyan', ax=ax2, width=width, position=0, label='quantity')
plt.legend(loc="upper right")
ax.set_ylabel('revenue in CU')
ax2.set_ylabel('volume in pcs')
plt.title('Sales trends')
plt.show()
We are analysing the remaining sales of 2.8 mCU, cleared of wholesale and abnormal transactions.
The distribution of sales shows that peak sales were observed in autumn of 2018 and were gradually decreasing after: by the end of the analysed period, sales fell by 1.5 times, and in October 2019 sales were almost half as much as in October 2018.
At the same time, sales declines were observed in January and June. The January decline is possible due to long holidays, and the June decline may have occured due to the vacation season.
The distribution of volumes of goods sold basically repeats the distribution of revenue, with the exception of two points:
Let's see how revenue is distributed among customers and orders.
# revenue and sales volume by top-10 customers
dbc = df.groupby(['cid'])[['revenue','quantity']].sum().sort_values('quantity', ascending=False).head(10)
fig = plt.figure()
ax = fig.add_subplot(111)
ax2 = ax.twinx()
width = 0.3
dbc.revenue.plot(kind='bar', color='magenta', ax=ax, width=width, position=1, label='revenue')
dbc.quantity.plot(kind='bar', color='cyan', ax=ax2, width=width, position=0, label='quantity')
# добавляем названия
plt.legend(loc="upper right")
ax.set_ylabel('revenue in CU')
ax2.set_ylabel('quantity in pcs')
plt.title('Revenue and sales volume by top-10 customers')
plt.show()
We see that the two largest buyers account for 7% of all sales, but next ones account for less than 2%. They are also the largest buyers in terms of quantity of goods.
Let's see how often they made purchases
# revenue by month
display(df.query('cid in ["ee86d3b","e40d7db"]')
.pivot_table(index='ymonth', values='quantity', aggfunc='sum')
.reset_index()
.sort_values('ymonth', ascending=True)
.plot(x='ymonth', y='quantity', kind='bar')); # строим диаграмму
plt.xlabel('month')
plt.ylabel('quantity')
plt.title('Customer ee86d3b and e40d7db purchases')
plt.show()
<Axes: xlabel='ymonth'>
They made purchases up until March 2019, but no further. These are probably lost clients.
How are goods distributed among orders?
display(df.pivot_table(index='order_id', values='quantity', aggfunc='sum')
.reset_index()
.sort_values('quantity', ascending=False)
.head(10)
.plot(x='order_id', y='quantity', kind='bar')
)
plt.xlabel('order number')
plt.ylabel('quantity in pcs')
plt.title('Top-10 orders by quantity of goods sold')
plt.show()
<Axes: xlabel='order_id'>
Since we excluded wholesale orders, the fluctuation in the number of goods in the remaining orders is not significant.
Let's move on to the analysis of popular products
# top-10 products
display(df.pivot_table(index='prod', values='quantity', aggfunc='sum')
.reset_index()
.sort_values('quantity', ascending=False)
.head(10)
.plot(x='prod', y='quantity', kind='bar')
)
plt.xlabel('product name')
plt.ylabel('quantity in pcs')
plt.title('Top-10 products')
plt.show()
<Axes: xlabel='prod'>
From the graph above we can conclude that the store sells mostly plants and flowers, including artificial ones.
Let's see which product category is in greatest demand.
To do this, we assigned each of the products one of the following six categories: clothing and shoes, plants, home and everyday life, kitchen utensils, storage and transportation etc.
# revenue and sales volumes by category
plotpie = (df.pivot_table(index='cat', values=['revenue','quantity'], aggfunc='sum')
.reset_index()
.sort_values('revenue', ascending=False)
)
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(12,12))
labels = plotpie['cat']
values = plotpie['revenue']
ax1.pie(values,labels = labels,autopct = '%1.1f%%') #plot first pie
ax1.set_title('Revenue by category')
labels = plotpie['cat']
values = plotpie['quantity']
ax2.pie(values,labels = labels,autopct = '%1.1f%%') #plot second pie
ax2.set_title('Volumes of goods sold by category')
plt.show();
Revenue and quantity of goods are distributed unevenly.
Firstly, 44% of sales come from plants, but they only account for 16% of revenue. It is likely that plants are not purchased all year round, so it is important to fill storages with them just before peak demand.
Secondly, the smallest category, “storage and transportation,” accounts for more than 21% of revenue. It is followed by clothing and shoes and household goods, each of which is responsible for a fifth of revenue.
Finally, kitchen and other products account for only 18% of revenue, marking 25% of sales. Perhaps the store should refrain from purchasing goods of such categories in favor of more profitable ones.
Let's see how sales are distributed over time: first by month, and then by day of week and hour
catsum = df.groupby(['ymonth','cat'])[['quantity']].sum()
catsum.columns = ['total_quantity']
fig = px.bar(catsum.reset_index().sort_values(by=['total_quantity'], ascending=False),
x='ymonth',
y='total_quantity',
color='cat',
)
fig.update_layout(title='Product sales by category',
xaxis_title='quantity in pcs',
yaxis_title='month',
yaxis={'categoryorder':'total ascending'})
fig.show()
Seeking seasonality in sales
We confirmed our hypothesis that plant sales are determined by seasonality. Peak sales occur in spring, or April-May, when the planting season begins. The same fact responds to our earlier observation that sales peaks are observed in April-May. This information will be useful for the store to effectively manage warehouse space.
Sales of clothing and shoes, on the contrary, logically grow in cold autumn and winter months, tnen decrease towards summer. Same applies to kitchen and other products.
There is no pronounced seasonality in sales of household goods and containers, but they decline towards the end of the year. It is difficult to make an unambiguous conclusion about the long-term trend based on data for October 2018 and October 2019, but for all product categories, except for plants and containers, sales fell by 1.5-4 times by October 2019. This may indicate the store's low competitiveness in all categories except plants.
Now let’s build hitmaps to analyze sales by day of week and hour
heatmap_catd = (df.pivot_table(index=['cat','dow','hour'], values='quantity', aggfunc='sum')
.reset_index()
.sort_values('dow', ascending=True)
)
heatmap_catd
| cat | dow | hour | quantity | |
|---|---|---|---|---|
| 0 | дом и быт | 1 | 0 | 4 |
| 249 | одежда и обувь | 1 | 15 | 6 |
| 250 | одежда и обувь | 1 | 16 | 3 |
| 251 | одежда и обувь | 1 | 17 | 9 |
| 252 | одежда и обувь | 1 | 18 | 18 |
| ... | ... | ... | ... | ... |
| 603 | растения | 7 | 6 | 3 |
| 604 | растения | 7 | 8 | 46 |
| 469 | прочее | 7 | 12 | 8 |
| 231 | кухонные принадлежности | 7 | 16 | 8 |
| 740 | хранение и перевозка | 7 | 23 | 14 |
741 rows × 4 columns
plants_hm = (heatmap_catd.query('cat == "растения"')
.pivot(index='hour', columns='dow', values='quantity')
)
fig, ax = plt.subplots(figsize=(5,5))
plt.title('Sales of plants by weekday and hour')
display(sns.heatmap(plants_hm, annot=True, fmt='.0f'))
<Axes: title={'center': 'Sales of plants by weekday and hour'}, xlabel='dow', ylabel='hour'>
clothes_hm = (heatmap_catd.query('cat == "одежда и обувь"')
.pivot(index='hour', columns='dow', values='quantity')
)
fig, ax = plt.subplots(figsize=(5,5))
plt.title('Sales of clothes and shoes by weekday and hour')
display(sns.heatmap(clothes_hm, annot=True))
<Axes: title={'center': 'Sales of clothes and shoes by weekday and hour'}, xlabel='dow', ylabel='hour'>
appliances_hm = (heatmap_catd.query('cat == "дом и быт"')
.pivot(index='hour', columns='dow', values='quantity')
)
fig, ax = plt.subplots(figsize=(5,5))
plt.title('Household items sales by weekday and hour')
display(sns.heatmap(appliances_hm, annot=True, fmt='.0f'))
<Axes: title={'center': 'Household items sales by weekday and hour'}, xlabel='dow', ylabel='hour'>
storage_hm = (heatmap_catd.query('cat == "хранение и перевозка"')
.pivot(index='hour', columns='dow', values='quantity')
)
fig, ax = plt.subplots(figsize=(5,5))
plt.title('Storage items sales by weekday and hour')
display(sns.heatmap(storage_hm, annot=True))
<Axes: title={'center': 'Storage items sales by weekday and hour'}, xlabel='dow', ylabel='hour'>
In all cases, there are virtually no sales at night until 9 am. The peak occurs between 9 and 15 hours, and plants are actively purchased even after 15.
In addition to sales of containers, which are distributed more or less evenly throughout the day, most sales occur on Monday-Wednesday.
Since most orders are made during daytime, discounts could be introduced for orders placed late in the evening to smooth out the load on operators.
Let's see if we can separate categories of goods that are bought together with other goods from those bought separately.
Main and additional product assortments
# order ID and unique goods
orders_ops = (df.pivot_table(index='order_id', values='prod', aggfunc='nunique')
.reset_index()
.sort_values('prod', ascending=False)
)
# int
orders_ops['prod'] = orders_ops['prod'].astype('int')
# we leave only orders in which one unique product is purchased
orders_ops = orders_ops['order_id'].loc[orders_ops['prod'] == 1]
# series to list
orders_ops = orders_ops.tolist()
# add a column to the dataset that returns 1 if the order contains one unique product
df['sole_order'] = [x in orders_ops for x in df['order_id']]
df['sole_order']
0 False
1 False
2 False
3 False
4 False
...
4778 True
4779 True
4780 True
4781 True
4782 True
Name: sole_order, Length: 4783, dtype: bool
# sales volume by category
display(df.query('sole_order == True')
.pivot_table(index='cat', values='quantity', aggfunc='sum')
.reset_index()
.sort_values('quantity', ascending=False)
.plot(x='cat', y='quantity', kind='bar'));
plt.xlabel('category')
plt.ylabel('quantity in pcs')
plt.title('If an order contains one product, what category does it belong to?')
plt.show()
<Axes: xlabel='cat'>
# sales volume by category
display(df.query('sole_order == False')
.pivot_table(index='cat', values='quantity', aggfunc='sum')
.reset_index()
.sort_values('quantity', ascending=False)
.head(15)
.plot(x='cat', y='quantity', kind='bar'));
plt.xlabel('category')
plt.ylabel('quantity in pcs')
plt.title('If an order contains several products, what category do they belong to?')
plt.show()
<Axes: xlabel='cat'>
So, we have the following ratio of the main and additional assortment for each category:
1. Plants: 52% main and 48% additional
2. Household items: 84% and 16%
3. Other: 82% and 18%
4. Kitchen apps: 82% and 18%
5. Clothing and shoes: 85% and 15%
6. Storage and transportation: 91% and 9%
Let's see what products are included in the main and additional range.
# top-3 products of each category
for i in df['cat'].unique():
cats = (df.query('cat == @i')
.groupby(['sole_order','cat','prod'])[['quantity']].sum()
.sort_values(by=['cat','quantity'], ascending=False)
.reset_index()
)
for k in df['sole_order'].unique():
display(cats.query('sole_order == @k').head(3))
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 3 | False | растения | Пеларгония зональная | 239 |
| 5 | False | растения | Пеларгония розебудная | 156 |
| 6 | False | растения | Рассада зелени | 149 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 0 | True | растения | Искусственный цветок | 323 |
| 1 | True | растения | Цветок искусственный | 298 |
| 2 | True | растения | Пеларгония зональная | 249 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 10 | False | одежда и обувь | Вешалка деревянная | 20 |
| 11 | False | одежда и обувь | Вешалка для | 19 |
| 12 | False | одежда и обувь | Плечики пластмассовые | 19 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 0 | True | одежда и обувь | Сушилка для | 278 |
| 1 | True | одежда и обувь | Гладильная доска | 125 |
| 2 | True | одежда и обувь | Вешалка для | 81 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 5 | False | дом и быт | Щетка для | 52 |
| 7 | False | дом и быт | Щетка-утюжок с | 50 |
| 15 | False | дом и быт | Набор вешалок | 28 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 0 | True | дом и быт | Ёрш унитазный | 103 |
| 1 | True | дом и быт | Коврик придверный | 98 |
| 2 | True | дом и быт | Щетка-сметка 4-х | 90 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 10 | False | хранение и перевозка | Банка стеклянная | 7 |
| 15 | False | хранение и перевозка | Короб стеллажный | 4 |
| 17 | False | хранение и перевозка | Ящик для | 3 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 0 | True | хранение и перевозка | Сумка-тележка хозяйственная | 111 |
| 1 | True | хранение и перевозка | Сумка-тележка 2-х | 92 |
| 2 | True | хранение и перевозка | Тележка багажная | 65 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 5 | False | кухонные принадлежности | Нож кухонный | 33 |
| 6 | False | кухонные принадлежности | Кружка НОРДИК | 30 |
| 12 | False | кухонные принадлежности | Тарелка суповая | 13 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 0 | True | кухонные принадлежности | Таз пластмассовый | 121 |
| 1 | True | кухонные принадлежности | Тарелка обеденная | 119 |
| 2 | True | кухонные принадлежности | Тарелка десертная | 80 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 8 | False | прочее | Муляж Апельсин | 29 |
| 10 | False | прочее | Муляж Яблоко | 25 |
| 14 | False | прочее | Муляж Красное | 20 |
| sole_order | cat | prod | quantity | |
|---|---|---|---|---|
| 0 | True | прочее | Муляж Яблоко | 124 |
| 1 | True | прочее | Муляж Банан | 105 |
| 2 | True | прочее | Муляж Лимон | 91 |
The differences in popular products of different assortments are as follows:
Thus, the assortments often overlap, but in terms of quantity we can distinguish the following leaders in the main assortment:
# categories frequency
display(df.query('sole_order == False')
.pivot_table(index='order_id', values='cat', aggfunc='nunique')
.reset_index()
.sort_values('cat', ascending=False)
.hist('cat',bins=6)
);
plt.xlabel('category')
plt.ylabel('orders quantity')
plt.title('200 orders contain products of a single category')
plt.show()
array([[<Axes: title={'center': 'cat'}>]], dtype=object)
# frequency diagram of product categories in orders with several unique products
display(df.query('sole_order == False')
.pivot_table(index='cat', values='order_id', aggfunc='nunique')
.reset_index()
.sort_values('order_id', ascending=False)
.plot(x='cat', y='order_id',kind='bar')
);
plt.xlabel('category')
plt.ylabel('goods in pcs')
plt.title('Of the 277 such orders, more than 200 contain plants')
plt.show()
# сколько всего таких заказов?
display(df.query('sole_order == False')['order_id'].nunique())
<Axes: xlabel='cat'>
277
We can see that goods of all categories, excluding plants, in 5 cases out of 6 (or more often) are purchased separately. Plants are purchased separately in 40% of cases.
However, there may be situations where an order contains several unique products, but they all belong to the same category. We found out that there are more than two hundred such orders. Moreover, out of 277 orders that contain 2 or more unique products, 200 contain plants.
Thus, although 60% of plants are purchased together with other goods, these other goods are mostly also plants.
Criterion for conclusion about the significance/insignificance of differences: Non-parametric Wilcoxon-Mann-Whitney test
Level of significance: 0.05
Null hypothesis: According to cleaned data, there are no statistically significant differences in the average revenue between the products of the main and additional ranges
Alternative hypothesis: The differences in average revenue between the main and additional assortment products according to the cleaned data are statistically significant
print('p-value:', "{0:.3f}".format(stats.mannwhitneyu(df[df['sole_order']== True]['revenue'],
df[df['sole_order']== False]['revenue'])[1]))
print('relative difference in average revenue','{0:.3f}'.format(df[df['sole_order']== True]['revenue'].mean()/df[df['sole_order']== False]['revenue'].mean()-1))
p-value: 0.000 relative difference in average revenue 3.764
Criterion for conclusion about the significance/insignificance of differences: Non-parametric Wilcoxon-Mann-Whitney test
Level of significance: 0.05
Null hypothesis: According to cleaned data, there are no statistically significant differences in the average quantity between the products of the main and additional assortments
Alternative hypothesis: The differences in the average quantity between the products of the main and additional assortments according to the cleaned data are statistically significant
print('p-value:', "{0:.3f}".format(stats.mannwhitneyu(df[df['sole_order']== True]['quantity'],
df[df['sole_order']== False]['quantity'])[1]))
print('relative difference in average quantity','{0:.3f}'.format(df[df['sole_order']== True]['quantity'].mean()/df[df['sole_order']== False]['quantity'].mean()-1))
p-value: 0.000 relative difference in average quantity 0.597
In both cases, p-value is less than the significance level, so we accept the hypothesis that there are no differences in both average revenue and the average number of goods sold in the main and additional ranges.
Having analysed the dataset, we came to the following conclusions:
1. Sales data covers 1 year and 1 month: from October 2018 to October 2019.
2. We found gaps in the numbering of orders - a question to technical support about its adequacy ensues.
3. This is a retail store selling goods mainly under 1,000 CU.
4. No full duplicates were found, but there are both orders duplicated at different times and erroneous buyer-order pairing. They took up about 30% of the dataset.
5. Wholesale sales accounted for 4.5% of operations and a third (1.3 mCU) of revenue, including a single purchase of 1,000 units of goods (probably an erroneous entry).
6. Retail sales amounted to 2.8 mCU. Monthly trend indicates a gradual reduction in revenue both in monetary and physical terms. The largest purchases in terms of volume are observed in April and May, during peak plant sales.
7. Sales structure is heterogeneous:
First, more than half of sales come from plants, but they only account for 17% of revenue. It is likely that seeds for planting are not purchased all year round, so it is important to fill storage areas with them just before peak demand.
Secondly, the smallest category, “storage and transportation,” brings in the most (a quarter) of revenue. It is followed by clothing and shoes, and household goods, each of which is responsible for a fifth of revenue.
Finally, kitchen and other products account for only 15% of revenue and for 17% of quantity. Perhaps the store should refrain to purchase goods of categories in favor of more profitable ones.
8. Seasonality is observed in sales of plants (April-May) and clothing and shoes (autumn-winter). Distribution by day of the week is almost uniform, with the bulk of sales occurring in the first half of the day, or from 9 a.m. to 3 p.m. (up to 6 p.m. for plants).
9. 85% of products in all categories, excluding plants, are sold separately. Plants are sold both separately (40%) and together with other products. However, most of these other products are other plants.
Recommendations:
1. Prepare unified instructions for order operators on how to enter data into the database so that there are no omissions or erroneous orders.
2. It may be worth coming up with profitable offers for wholesale buyers: there are few of them, but they account for third of the revenue.
3. Make adjustments to the procurement and warehouse management system. Plants take up the most space, but are sold almost only in spring. Accordingly, in autumn-winter it would be possible to free up more space for other categories of goods: clothes and shoes needed in cold weather, and containers that bring in more revenue.
4. The purchase of kitchen and other goods should be reduced or abandoned in favor of more profitable ones.
5. Consider a system of discounts when ordering goods in the evening to relieve operators.